## Regular Expressions

Regular expressions are text matching patterns described with a formal syntax. You'll often hear regular expressions referred to as 'regex' or 'regexp' in conversation. Regular expressions can include a variety of rules, fro finding repetition, to text-matching, and much more. As you advance in Python you'll see that a lot of your parsing problems can be solved with regular expressions.

If you're familiar with Perl, you'll notice that the syntax for regular expressions are very similar in Python.

## Searching for Patterns in Text

One of the most common uses for the re module is for finding patterns in text. Let's do a quick example of using the search method in the re module to find some text:

In [3]:
import re
# List of patterns to search for
patterns = [ 'term1', 'term2' ]
# Text to parse
text = 'This is a string with term1, but it does not have the other term.'
for pattern in patterns:
    print('Searching for "%s" in: \n"%s"' % (pattern, text))
    if re.search(pattern,  text):
        print('Match was found.')
    else:
        print('No Match was found.')

Searching for "term1" in: 
"This is a string with term1, but it does not have the other term."
Match was found.
Searching for "term2" in: 
"This is a string with term1, but it does not have the other term."
No Match was found.


Now we've seen that re.search() will take the pattern, scan the text, and then returns a **Match** object. If no pattern is found, a **None** is returned. To get a clearer picture of this match object, check out the cell below:

In [4]:
# List of patterns to search for
pattern = 'term1'
# Text to parse
text = 'This is a string with term1, but it does not have the other term.'
match = re.search(pattern,  text)
print(type(match), match.start(), match.end())
text = 'This is a string with a TERM1, ...'
match = re.search(pattern,  text, re.IGNORECASE)
print(match.start())

<class 're.Match'> 22 27
24


In [5]:
import re
line = "Cats are smarter than dogs"
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)
if matchObj:
   print("matchObj.group() : ", matchObj.group())
   print("matchObj.group(1) : ", matchObj.group(1))
   print("matchObj.group(2) : ", matchObj.group(2))
else:
   print("No match!!")

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter


## Split with regular expressions

Let's see how we can split with the re syntax. This should look similar to how you used the split() method with strings.

In [6]:
# Term to split on
split_term = '@'
phrase = 'What is the domain name of someone with the email: hello@gmail.com'
# Split the phrase
re.split(split_term,phrase)

['What is the domain name of someone with the email: hello', 'gmail.com']

Note how re.split() returns a list with the term to spit on removed and the terms in the list are a split up version of the string. 

## Finding all instances of a pattern

You can use re.findall() to find all the instances of a pattern in a string. For example:

In [7]:
# Returns a list of all matches
x=re.findall('match','test phrase match is in match middle')
print(x)
x=re.finditer('match','test phrase match is in match middle')
print(x)

['match', 'match']
<callable_iterator object at 0x0000023075AF1130>


## Pattern re Syntax

This will be the bulk of this lecture on using re with Python. Regular expressions supports a huge variety of patterns to find where a single string occurred. 

We can use *metacharacters* along with re to find specific types of patterns. 

Since we will be testing multiple re syntax forms, let's create a function that will print out results given a list of various regular expressions and a phrase to parse:

In [1]:
def multi_re_find(patterns,phrase):
    '''
    Takes in a list of regex patterns
    Prints a list of all matches
    '''
    for pattern in patterns:
        print('Searching the phrase using the re check: %r' %pattern)
        print(re.findall(pattern,phrase))

## Repetition Syntax

There are five ways to express repetition in a pattern:

    1.) A pattern followed by the meta-character * is repeated zero or more times. 
    2.) Replace the * with + and the pattern must appear at least once. 
    3.) Using ? means the pattern appears zero or one time. 
    4.) For a specific number of occurrences, use {m} after the pattern, where m is replaced with the number of times         the pattern should repeat. 
    5.) Use {m,n} where m is the minimum number of repetitions and n is the maximum. Leaving out n ({m,}) means the           value appears at least m times, with no maximum.
    
Now we will see an example of each of these using our multi_re_find function:

In [8]:
test_phrase = 'sdsdsdd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = [ 'sd*',     # s followed by zero or more d's
                'sd+',          # s followed by one or more d's
                'sd?',          # s followed by zero or one d's
                'sd{3}',        # s followed by three d's
                'sd{2,3}',      # s followed by two to three d's
                ]

multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: 'sd*'
['sd', 'sd', 'sdd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']
Searching the phrase using the re check: 'sd+'
['sd', 'sd', 'sdd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']
Searching the phrase using the re check: 'sd?'
['sd', 'sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']
Searching the phrase using the re check: 'sd{3}'
['sddd', 'sddd', 'sddd', 'sddd']
Searching the phrase using the re check: 'sd{2,3}'
['sdd', 'sddd', 'sddd', 'sddd', 'sddd']


## Character Sets

Character sets are used when you wish to match any one of a group of characters at a point in the input. Brackets are used to construct character set inputs. For example: the input [ab] searches for occurrences of either a or b.
Let's see some examples:

In [9]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'
test_patterns = [ '[sd]',    # either s or d
            's[sd]+']   # s followed by one or more s or d
multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: '[sd]'
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']
Searching the phrase using the re check: 's[sd]+'
['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']


## Exclusion

We can use ^ to exclude terms by incorporating it into the bracket syntax notation. Let's see some examples:

Use [^!.? ] to check for matches that are not a !,.,?, or space. Add the + to check that the match appears at least once, this basically translates to finding the words.

In [29]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'
print(re.findall('[^!.? ]+',test_phrase))

['This', 'is', 'a', 'string', 'But', 'it', 'has', 'punctuation', 'How', 'can', 'we', 'remove', 'it']


## Character Ranges

As character sets grow larger, typing every character that should (or should not) match could become very tedious. A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. The format used is [start-end].

Common use cases are to search for a specific range of letters in the alphabet, such [a-f] would return matches with any instance of letters between a and f. 

Let's walk through some examples:

In [60]:

test_phrase = 'This is an example sentence. Lets see if we find some letters.'
test_patterns=[ '[a-z]+',     # sequences of lower case letters
                '[A-Z]+',     # sequences of upper case letters
                '[a-zA-Z]+',  # sequences of lower or upper case letters
                '[A-Z][a-z]+']# 1 uppercase letter followed by lowercase letters
multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: '[a-z]+'
['his', 'is', 'an', 'example', 'sentence', 'ets', 'see', 'if', 'we', 'find', 'some', 'letters']
Searching the phrase using the re check: '[A-Z]+'
['T', 'L']
Searching the phrase using the re check: '[a-zA-Z]+'
['This', 'is', 'an', 'example', 'sentence', 'Lets', 'see', 'if', 'we', 'find', 'some', 'letters']
Searching the phrase using the re check: '[A-Z][a-z]+'
['This', 'Lets']


## Escape Codes

You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more. For example:

<table border="1" class="docutils">
<colgroup>
<col width="14%" />
<col width="86%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Code</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\d</span></tt></td>
<td>a digit</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\D</span></tt></td>
<td>a non-digit</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\s</span></tt></td>
<td>whitespace (tab, space, newline, etc.)</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\S</span></tt></td>
<td>non-whitespace</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\w</span></tt></td>
<td>alphanumeric</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\W</span></tt></td>
<td>non-alphanumeric</td>
</tr>
</tbody>
</table>

Escapes are indicated by prefixing the character with a backslash (\\). Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. Using raw strings, created by prefixing the literal value with r, for creating regular expressions eliminates this problem and maintains readability.

In [62]:
test_phrase = 'This is  a string with some numbers 1233 a45 and a symbol #hashtag 13'
test_patterns=[ r'\d+', # sequence of digits
                r'\D+', # sequence of non-digits
                r'\s+', # sequence of whitespace
                r'\S+', # sequence of non-whitespace
                r'\w+', # alphanumeric characters
                r'\W+', # non-alphanumeric
                ]
multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: '\\d+'
['1233', '45', '13']
Searching the phrase using the re check: '\\D+'
['This is  a string with some numbers ', ' a', ' and a symbol #hashtag ']
Searching the phrase using the re check: '\\s+'
[' ', '  ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
Searching the phrase using the re check: '\\S+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'a45', 'and', 'a', 'symbol', '#hashtag', '13']
Searching the phrase using the re check: '\\w+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'a45', 'and', 'a', 'symbol', 'hashtag', '13']
Searching the phrase using the re check: '\\W+'
[' ', '  ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #', ' ']


In [65]:
import re
phone = "2004-959-559 # This is Phone Number"
# Delete Python-style comments
num = re.sub(r'#.*', "", phone)
print("Phone Num : ", num)
# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print("Phone Num : ", num)

Phone Num :  2004-959-559 
Phone Num :  2004959559


In [37]:
from re import *
t = 'The quick brown fox born on 1/23/2013 jumped over the lazy dog born on 10/6/10.'

In [70]:
sub(r'(\w*)o(\w*)', r'=\1o\2=', t)

'The quick =brown= =fox= =born= =on= 1/23/2013 jumped =over= the lazy =dog= =born= =on= 10/6/10.'

In [15]:
sub(r'(?P<firstPartOfWord>\w*)o(?P<after>\w*)', r'=\g<firstPartOfWord>o\g<after>=', t)

'The quick =brown= =fox= =born= =on= 1/23/2013 jumped =over= the lazy =dog= =born= =on= 10/6/10.'

In [16]:
sub(r'(?x) (?P<before> \w*) o (?P<after> \w*)', r'=\g<before>o\g<after>=', t)

'The quick =brown= =fox= =born= =on= 1/23/2013 jumped =over= the lazy =dog= =born= =on= 10/6/10.'

In [72]:
sub(r'(?x) (\w*o\w*)', r'=\1=', t)

'The quick =brown= =fox= =born= =on= 1/23/2013 jumped =over= the lazy =dog= =born= =on= 10/6/10.'

In [18]:
fi = finditer(r'(?x) / (.* ) /', t)

[m.group(1) for m in fi]

['23/2013 jumped over the lazy dog born on 10/6']

In [19]:
fi = finditer(r'(?x) / (.*? ) /', t)

[m.group(1) for m in fi]

['23', '6']

In [20]:
fi = finditer(r'(?x) ([aeiou]) .* \1', t)

[m.group(0) for m in fi]

['e quick brown fox born on 1/23/2013 jumped over the', 'og born o']

In [21]:
fi = finditer(r'(?x) ([aeiou]) .*? \1', t)

[m.group(0) for m in fi]

['e quick brown fox born on 1/23/2013 jumpe', 'over the lazy do', 'orn o']

In [22]:
split(r'\w+u\w+', t)

['The ',
 ' brown fox born on 1/23/2013 ',
 ' over the lazy dog born on 10/6/10.']

In [23]:
split(r'(\w+u\w+)', t)

['The ',
 'quick',
 ' brown fox born on 1/23/2013 ',
 'jumped',
 ' over the lazy dog born on 10/6/10.']

In [24]:
split(r'(\w*o\w*)', t)

['The quick ',
 'brown',
 ' ',
 'fox',
 ' ',
 'born',
 ' ',
 'on',
 ' 1/23/2013 jumped ',
 'over',
 ' the lazy ',
 'dog',
 ' ',
 'born',
 ' ',
 'on',
 ' 10/6/10.']

Here we split on 4-letter words.  Then we do it again and capture the 4-letter words.

In [25]:

split(r'\b\w{4}\b', t)


['The quick brown fox ',
 ' on 1/23/',
 ' jumped ',
 ' the ',
 ' dog ',
 ' on 10/6/10.']

In [26]:

split(r'\b(\w{4})\b', t)


['The quick brown fox ',
 'born',
 ' on 1/23/',
 '2013',
 ' jumped ',
 'over',
 ' the ',
 'lazy',
 ' dog ',
 'born',
 ' on 10/6/10.']