# Regular Expressions in Python

<br>Regular expressions are used to search and match specific patterns of text.<br>
Sometimes "regex" word is also used.<br>
Regular expressions may look very complicated but are very powerful.<br><br>
NOTE: In examples below, <b>.finditer</b> method is widely used.<br>
There's also other usable methods, some of which are presented last in this document<br><br>
There's a built-in Python module called <b>re</b><br>
To use it, simply import it:

In [None]:
import re

In the following code example there's a very long text and a shorter, normal sentence to test regular expressions with:

In [None]:
import re
text_to_search = '''
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

MetaCharacters (Needs to be escaped):
. ^ $ * + ? { }  [ ] \ ( ) 

taitotalo.fi

010-80-80-90
800-80-80-90
900-80-80-90
010*12*34*56

cat
hat
bat
rat
abc

Dr. Jones
Mr. Anderson
Mrs. Robinson
Mr. Mister
'''

example_sentence = 'And here\'s to you, Mrs. Robinson'

### Raw strings in Python

In code example above, there's these MetaCharacters, that needs to be escaped with <b>\\</b> (backslash) sign if we want to use them.<br>
For example if we want to print '\n means newline in many programming languages',<br> we have to write our code like this to escape the backslash-character:<br>

In [None]:
print ('\\n means newline in many programming languages')
print('\n means newline in many programming languages. Here the escape is not handled and new line is generated')

However, there's also a raw string option in Python (starting print-statement with <b>r</b>, which allows to *literally* write strings with special characters directly:<br>

In [None]:
print(r'\n means newline in many programming languages')

### Finding all occurrences of certain (sub)string in string:

In [None]:
pattern = re.compile(r'abc')


<b>re.compile(pattern, flags=0)</b><br>
Compile a regular expression pattern into a regular expression object, <br>which can be used for matching using its match(), search() and <br>other methods.<br>
The expression’s behaviour can be modified by specifying a flags value.<br> Values can be any of the following variables, combined using bitwise OR (the | operator).<br>

In [None]:
#Example of finding all occurrences of 'abc'-word:
import re
pattern = re.compile(r'abc')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match) #Notice case-sensitivity in 'abc',not including 'ABC'


So the print above finds only one occurrence and <br>generates the following result:<br><br>
<re.Match object; span=(1, 4), match='abc'><br><br>
Here the span tells us where the start (inclusive) and end (exclusive) <br>of a string we'are trying to find, is.<br>



In [None]:
#We can confirm the statement above:
print(text_to_search[1:4])

# Python regular exception snippets

.         - Any character except new line<br>
\d        - digit (0-9)<br>
\D        - Not a digit<br>
\w        - Word character (a-z, A-Z, 0-9, _ )<br>
\W        - Not a Word character<br>
\s        - Whitespace (space, tab, newline)<br>
\S        - Not Whitespace<br>
<br>
\b        - Word boundary<br> # If first in line, first occurrence is included<br># Otherwise, returns first occurrence of (sub)string in string<br>
\B        - Not Word boundary<br>
^         - Beginning of a String<br>
$         - End of a String<br>
<br>
[]        - Matches characters in brackets<br>
[^]       - Matches characters Not in brackets<br>
|         - Either or<br>
( )       - Group<br>


### Some examples

In [None]:
# Example 1; Desired output is to print all digits:
pattern = re.compile(r'\d')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Example 2: Find out if a sentence starts with word 'And':

In [18]:
pattern = re.compile(r'^And')

matches = pattern.finditer(example_sentence)

for match in matches:
    print(match)

<re.Match object; span=(0, 3), match='And'>


Example 3: Find out if a sentence ends with word 'Robinson':

In [20]:
pattern = re.compile(r'Robinson$')

matches = pattern.finditer(example_sentence)

for match in matches:
    print(match)

<re.Match object; span=(24, 32), match='Robinson'>


In [21]:
pattern = re.compile(r'\bHa')

matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

<re.Match object; span=(67, 69), match='Ha'>
<re.Match object; span=(70, 72), match='Ha'>


Example 4 (this is a bit odd): Find out the(first) occurrences of word 'Ha' bound to a word

### Practical examples, real-life use

The power of regular expression lies when combining the snippets.<br>
This way the user can create whatever expressions.<br>

Find out all phone numbers using format xxx-xx-xx-xx (010-80-80-90):<br>
Here the existence of '-'' is not strictly checked, <br>
but instead checking if there's any character between the digits:

In [22]:

pattern = re.compile(r'\d\d\d.\d\d.\d\d.\d\d')
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)
    
# Notice that there's even better real-life example of this below in
# 'Quantifiers' section. 

<re.Match object; span=(157, 169), match='010-80-80-90'>
<re.Match object; span=(170, 182), match='800-80-80-90'>
<re.Match object; span=(183, 195), match='900-80-80-90'>
<re.Match object; span=(196, 208), match='010*12*34*56'>


In case we only want to show phone numbers with '-' -dashes;<br>
we can use so called character sets using [] square brackets.<br>

In [None]:
pattern = re.compile(r'\d\d\d[-]\d\d[-]\d\d[-]\d\d')
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Inside character set can be multiple characters, <br>
which would all be accepted. <br>
For example, if in addition to <b>-</b> dashes; <b>*</b> stars would also be valid:

In [None]:
pattern = re.compile(r'\d\d\d[-*]\d\d[-*]\d\d[-*]\d\d')
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

# Notice that inside brackets there is no need for adding \ backslash

In case we want to parse only phone numbers starting with '800' or '900':

In [None]:
pattern = re.compile(r'[89]00[-]\d\d[-*]\d\d[-*]\d\d')
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

### Dash also indicating a range

When '-' dash is alone or with some searchable character(s) in character set,<br> 
it acts like searching the dash, but _between_ values or characters<br>
it defines <b>range</b>:

Here the first digit must be between 6 and 9:

In [None]:
pattern = re.compile(r'[6-9]\d\d[-*]\d\d[-*]\d\d') 
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Here the first letter of sentence must be between a and i
or with capital letter between A and I

pattern = re.compile(r'^[a-iA-I]')
matches = pattern.finditer(example_sentence)

for match in matches:
    print(match)

### Caret inside character set negates the search

In previous example, a <b>^</b> caret was outside the character set and as told before,<br> it searches from the beginning.<br><br>
However, when _inside_ a character set, a caret negates the search<br>
so that the search ignores all characters inside the character set:

Print out all example_sentence characters that are not between a and i or A and I in alphabet:

In [None]:
pattern = re.compile(r'[^a-iA-I]')
matches = pattern.finditer(example_sentence)

for match in matches:
    print(match)

How to not include hat, but include cat, rat and bat:
We negate the character 'h' and tell 'at' must come after first char:

In [None]:
pattern = re.compile(r'[^h]at')
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

# Quantifiers

\*       0 or more<br>
\+       1 or more<br>
\?       0 or one<br>
{3}      exact number<br>
{3,4}    range of numbers (min, max)<br

Back to our phone number example; we can now specify how many numbers we want:

In [None]:
# old way:
#pattern = re.compile(r'\d\d\d.\d\d.\d\d.\d\d')

#with quantifiers:
pattern = re.compile(r'\d{3}.\d{2}.\d{2}.\d{2}')
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

Here is some names from our text_to_search -string:<br>
Dr. Jones<br>
Mr. Anderson<br>
Mrs. Robinson<br>
Mr. Mister<br>

Let's say we want to include all that begin with Mr with or without
dot after the prefix.<br> 
Now we can use <b>?</b> for '0 or One':

In [None]:
pattern = re.compile(r'Mr\.?') # Notice the \ for . as escape char
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

#Notice that we are not including any Mrs here, only Mr

Same as above, but now including also whitespace and capital letter in last name:

In [None]:
pattern = re.compile(r'Mr\.?\s[A-Z]\w*') 
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

### Groupings

Groups come in handy, when we want to for example; use the <b>|</b> or -syntax.<br>
Groupings are made inside <b>()</b>-brackets

Include all that begin with Mr, Ms, or Mrs:

In [None]:
pattern = re.compile(r'M(r|s|rs)\.?\s[A-Z]\w*')
#this works also:
#pattern = re.compile(r'(Mr|Ms|Mrs)\.?\s[A-Z]\w*')
matches = pattern.finditer(text_to_search)

for match in matches:
    print(match)

### Exercises

- Create an Email validity checker:<br><br>
Here's some valid sample email addresses:<br>
teacher@taitotalo.fi<br>
firstname.lastname@taitotalo.info<br>
ict-teacher-777@taitotalo-server.net<br>

In [None]:
import re
emails = '''
teacher@taitotalo.fi
first.last@taitotalo.info
ict-teacher-777@taito-server.net
'''
pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
matches = pattern.finditer(emails)
for match in matches:
    print(match)

- Create www-page parser:<br><br>
Valid page address formats:

https://www.google.com<br>
https://youtube.com<br>
https://www.nasa.gov<br>
http://linkedin.com<br>



        



In [None]:
import re
urls = '''
http://www.google.com
https://youtube.com
http://www.nasa.gov
http://linkedin.com
'''
pattern = re.compile(r'(https?://(www\.)?(\w+)(\.\w+))')
matches = pattern.finditer(urls)
for match in matches:
    print(match)

There's an option to parse the pattern using <b>.group</b>.

In [None]:
import re
urls = '''
http://www.google.com
https://youtube.com
http://www.nasa.gov
http://linkedin.com
'''
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.finditer(urls)
for match in matches:
    #print(match.group(0)) # 0 here prints everything 
    #print(match.group(1)) #prints www. None www. None
    print(match.group(2)) # prints google youtube nasa linkedin
    #print(match.group(3))  #prints .com .com .gov .com

There's also a <b>sub</b> method which can be used as a shortcut doing the same thing:

In [None]:
import re
urls = '''
http://www.google.com
https://youtube.com
http://www.nasa.gov
http://linkedin.com
'''
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
subbed_urls = pattern.sub(r'\2\3', urls)
print(subbed_urls) #www.google.com youtube.com nasa.gov linkedin.com

## Other methods (than .finditer )

Depending on the purpose, the method used should be considered.<br><br>
<b>.findall</b> returns the matches as a list of strings instead of objects (as in .finditer) IF there is no groups.<br>
If there are groups, it will ONLY return the groups:

In [None]:
#Not so usable here, because of groups
import re
urls = '''
http://www.google.com
https://youtube.com
http://www.nasa.gov
http://linkedin.com
'''
pattern = re.compile(r'http://(www\.)(\w+)(\.\w+)')
matches = pattern.findall(urls)
for match in matches:
    print(match)

In [None]:
#More usable here
import re
phone_numbers = '''
010-80-80-90
800-80-80-90
900-80-80-90
010*12*34*56
'''
pattern = re.compile(r'\d{3}.\d{2}.\d{2}.\d{2}')
matches = pattern.findall(phone_numbers)
for match in matches:
    print(match)


<b>.match</b> method searches if given string is _at the beginning_ of a string:<br>
(Notice that there's also option to use <b>^</b> as shown before) 

In [None]:
import re
sentence = "Cat said meow"
pattern = re.compile(r'Cat')
matches = pattern.match(sentence)
print(matches)

<b>.search</b> method searches the _entire_ string and return the _first_ match: <br>
(If none found, returns None)

In [None]:
import re
sentence = "Cat said meow"
pattern = re.compile(r'meow')
matches = pattern.search(sentence)
print(matches)

To search the sentence and ignore if there's upper/lowercase letters,<br> there's an option to use flags.<br>
There's more flags but the ignore upper/lowercase is shown here:

In [None]:
import re
sentence = "Cat said meow"
pattern = re.compile(r'cat',re.IGNORECASE) #or re.I
matches = pattern.search(sentence)
print(matches) #finds Cat